15 research outputs found

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    Get PDF
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

    Automatic communication coalescing for irregular computations in UPC language

    No full text
    Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. However, fine grain accesses on shared structures have been identified as one of the main bottlenecks of PGAS languages. Manual or compiler assistance code optimization is required to avoid fine grain accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for prefetching and coalescing of shared accesses at runtime. Larger messages decrease the impact of remote access latency and increase the efficiency of the network communication. We have implemented our optimization for the Unified Parallel C (UPC) language. An experimental evaluation on a distributed-memory environment using a Power7 cluster demonstrates the benefits of our optimization.Peer ReviewedPostprint (published version

    Combining Static and Dynamic Data Coalescing in Unified Parallel C

    No full text

    Automatic communication coalescing for irregular computations in UPC language

    No full text
    Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. However, fine grain accesses on shared structures have been identified as one of the main bottlenecks of PGAS languages. Manual or compiler assistance code optimization is required to avoid fine grain accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for prefetching and coalescing of shared accesses at runtime. Larger messages decrease the impact of remote access latency and increase the efficiency of the network communication. We have implemented our optimization for the Unified Parallel C (UPC) language. An experimental evaluation on a distributed-memory environment using a Power7 cluster demonstrates the benefits of our optimization.Peer Reviewe

    Mechanics of rock indentation and its application to cutter performance prediction

    No full text
    SIGLEAvailable from British Library Document Supply Centre- DSC:DXN061584 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Reducing compiler-inserted instrumentation in unified-parallel-C code generation

    No full text
    Programs written in Partitioned Global Address Space (PGAS) languages can access any location of the entire address space via standard read/write operations. However, the compiler have to create the communication mechanisms and the runtime system to use synchronization primitives to ensure the correct execution of the programs. However, PGAS programs may have fine-grained shared accesses that lead to performance degradation. One solution is to use the inspector-executor technique to determine which accesses are indeed remote and which accesses may be coalesced in larger remote access operations. A straightforward implementation of the inspector-executor in a PGAS system may result in excessive instrumentation that hinders performance. This paper introduces a shared-data localization transformation based on linear memory descriptors (LMADs) that reduces the amount of instrumentation introduced by the compiler into programs written in the UPC language and describes a prototype implementation of the proposed transformation. A performance evaluation, using up to 2048 cores of a POWER 775 supercomputer, allows for a prediction that applications with regular accesses can achieve up to 180% of the performance of handoptimized versions while applications with irregular accesses yield performance gain from 1.12X up to 6.3X speedup

    Combining static and dynamic data coalescing in unified parallel C

    Get PDF
    Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15× and 21 × over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.Peer Reviewe

    Improving communication in PGAS environments: Static and dynamic coalescing in UPC

    No full text
    The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.Peer Reviewe

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    No full text
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer Reviewe

    Combining static and dynamic data coalescing in unified parallel C

    No full text
    Significant progress has been made in the development of programming languages and tools that are suitable for hybrid computer architectures that group several shared-memory multicores interconnected through a network. This paper addresses important limitations in the code generation for partitioned global address space (PGAS) languages. These languages allow fine-grained communication and lead to programs that perform many fine-grained accesses to data. When the data is distributed to remote computing nodes, code transformations are required to prevent performance degradation. Until now code transformations to PGAS programs have been restricted to the cases where both the physical mapping of the data or the number of processing nodes are known at compilation time. In this paper, a novel application of the inspector-executor model overcomes these limitations and allows profitable code transformations, which result in fewer and larger messages sent through the network, when neither the data mapping nor the number of processing nodes are known at compilation time. A performance evaluation reports both scaling and absolute performance numbers on up to 32,768 cores of a Power 775 supercomputer. This evaluation indicates that the compiler transformation results in speedups between 1.15× and 21 × over a baseline and that these automated transformations achieve up to 63 percent the performance of the MPI versions.Peer Reviewe
    corecore